1. Data Summary

The data is recieved from the Amercian Communinity Survey. The data basically consist of the personal information of the people currently residing in United States of America between 2012 - 2016. This is basically an observational study on over 10 Million of records of data being collected.

Here is a glimpse/description of data which is used to perform this analysis.

The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey.The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). The PUMS dataset include variables for nearly every question on the survey. Each record in the file represents a single person.The PUMS contain data on approximately one percent of the United States population.

The data contains 7,487,361 rows and 234 columns. Each row this very data set is scribbled by a response from one person.

Since the data is quite large , I have filtered out only 8 cases which is required for my analysis. The important variables I have choosen for my analysis are as follows :-

Data Description

Variable Description Data Type Data Dictionary
ST State code based on 2010 Census definitions numerical, discrete / data as numerical 01 - 72 Allocated Statenames (01 - Alabama/AL , 02- Alaska/AK ). Consist of Abbreviations.
VALP Property value numerical, discrete / data as numerical bbbbbbb N/A (GQ/vacant units, except “for-sale-only” and “sold, not occupied”/not owned or being bought) ; 0000001..9999999 $1 to $9999999 (Rounded and top-coded)
LAPTOP Laptop or desktop categorical, variable / data as numerical b-N/A (GQ/vacant);1-Yes;2-No
HISPEED Broadband (high speed) Internet service such as cable, fiber optic, or DSL service categorical, variable / data as numerical b - N/A (GQ/vacant/no paid access to the internet);1-Yes;2-No
FULP Fuel cost (yearly cost for fuels other than gas and electricity, use ADJHSG to adjust values 3 and over to constant dollars) numerical, discrete bbbb - N/A (GQ/vacant);0001 - Included in rent or in condo fee;0002 - No charge or these fuels not used ; 0003..9999 -$3 to $9999 (Rounded and top-coded).
GASP Gas (monthly cost, use ADJHSG to adjust GASP values 4 and over to constant dollars) numerical, discrete bbb -N/A (GQ/vacant);001 -Included in rent or in condo fee;002 -Included in electricity payment;003-No charge or gas not used;004..999-$4 to $999 (Rounded and top-coded).
TYPE Type of unit categorical, variable / data as numerical 1-Housing unit;2-Institutional group quarters;3-Noninstitutional group quarters.
TAXP Property taxes (yearly amount, no adjustment factor is applied) categorical, variable / data as numerical bb-N/A (GQ/vacant/not owned or being bought);01-None;02-$1 - $49;03-$50 - $99 and so on.
WIF Workers in family during the past 12 months categorical, variable / data as numerical b -N/A (GQ/vacant/non-family household);0-No Worker;1-1 worker;2-2 worker;3-3 worker.
HINCP Household income (past 12 months, use ADJINC to adjust HINCP to constant dollars) numerical, discrete / data as numerical N/A (GQ/vacant) ; 00000000 -No household income;-0059999 -Loss of $59,999 or more; -0059998..-0000001 -Loss of $1 to $59,998;00000001 -$1 or Break even;00000002..99999999 -Total household income in dollars (Components are rounded).
GRNTP Gross rent (monthly amount, use ADJHSG to adjust GRNTP to constant dollars) numerical, discrete / data as numerical N/A (GQ/vacant/owned or being bought/occupied without rent payment);00001..99999 -$1 - $99999 (Components are rounded).
RNTP Monthly rent (use ADJHSG to adjust RNTP to constant dollars) numerical, discrete / data as numerical N/A (GQ/vacant units, except “for rent” and “rented, not occupied”/owned or being bought/occupied without rent payment);00001..99999 -$1 to $99999 (Rounded and top-coded).
FINCP Family income (past 12 months, use ADJINC to adjust FINCP to constant dollars) numerical, discrete / data as numerical N/A (GQ/vacant) ; 00000000 - No family income ;-0059999 -Loss of $59,999 or more; -0059998..-0000001 Loss of $1 to $59,998;00000001 -$1 or Break even;00000002..99999999 -Total family income in dollars (Components are rounded)
RMSP Number of Rooms numerical, discrete / data as numerical N/A (GQ); 00..99 -Rooms (Top-coded).
NP Number of persons associated with this housing record numerical, discrete / data as numerical 00 -Vacant unit;01 -One person record (one person in household or any person in group quarters);02..20 -Number of person records (number of persons in household).
ADJINC Adjustment factor for income and earnings dollar amounts (6 implied decimal places) categorical, variable / data as numerical 1061971 -2013 factor (1.007549 * 1.05401460) ; 1045195 -2014 factor (1.008425 * 1.03646282);1035988 -2015 factor (1.001264 * 1.03468042);1029257 -2016 factor (1.007588 * 1.02150538);1011189 -2017 factor (1.011189 * 1.00000000).
ACR Lot size categorical, variable / data as numerical N/A (GQ/not a one-family house or mobile home);1 -House on less than one acre; 2-House on one to less than ten acres;3 -House on ten or more acres.
FINCP Family income (past 12 months, use ADJINC to adjust FINCP to constant dollars) categorical, variable / data as numerical N/A (GQ/vacant);00000000 -No family income;-0059999 -Loss of $59,999 or more;-0059998..-0000001 - Loss of $1 to $59,998;00000001 - $1 or Break even;00000002..99999999 -Total family income in dollars (Components are rounded)
INSP Fire/hazard/flood insurance (yearly amount, yearly amount, use ADJHSG to adjust INSP to constant dollars) numerical, discrete N/A (GQ/vacant/not owned or being bought);00000 - None ; 00001..10000 - $1 to $10000 (Rounded and top-coded).
FTAXP Property taxes (yearly amount) allocation flag numerical, discrete N/A (GQ);No;Yes
FHINCP Household income (past 12 months) allocation flag numerical, discrete N/A (GQ);No;Yes
FINSP Fire, hazard, flood insurance (yearly amount) allocation flag numerical, discrete N/A (GQ);No;Yes
YBL When structure first built numerical, discrete N/A (GQ);1939 or earlier;1940 to 1949;1950 to 1959;1960 to 1969;1970 to 1979;1980 to 1989;1990 to 1999;2000 to 2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017.
MV When moved into this house or apartment categorical, variable / data as numerical N/A (GQ);12 months or less;13 to 23 months;2 to 4 years;5 to 9 years;10 to 19 years;20 to 29 years;30 years or more.
HHL Household language categorical, variable N/A (GQ);English only;Spanish;Other Indo-European languages;Asian and Pacific Island language;Other language;
WIF Workers in family during the past 12 months categorical, variable N/A (GQ);No workers;1 worker;2 worker;3 or More workers in Family.
VEH Vehicles (1 ton or less) available categorical, variable N/A (GQ);1 vehicle;2 vehicles;3 vehicles;4 vehicles;5 vehicles;6 vehicle or more.

Intricated Data Information

## Time difference of 1.140622 mins
Table continues below
ST VALP LAPTOP HISPEED
Min. : 1.00 Min. : 100 Min. :1.0 Min. :1.0
1st Qu.:12.00 1st Qu.: 100000 1st Qu.:1.0 1st Qu.:1.0
Median :27.00 Median : 180000 Median :1.0 Median :1.0
Mean :27.83 Mean : 276505 Mean :1.2 Mean :1.2
3rd Qu.:42.00 3rd Qu.: 320000 3rd Qu.:1.0 3rd Qu.:1.0
Max. :56.00 Max. :6308000 Max. :2.0 Max. :2.0
NA NA’s :3139781 NA’s :1355137 NA’s :2625724
Table continues below
FULP GASP TYPE TAXP WIF
Min. : 1.0 Min. : 1.0 Min. :1.00 Min. : 1 Min. :0
1st Qu.: 2.0 1st Qu.: 3.0 1st Qu.:1.00 1st Qu.:19 1st Qu.:1
Median : 2.0 Median : 20.0 Median :1.00 Median :31 Median :2
Mean : 115.6 Mean : 46.3 Mean :1.15 Mean :34 Mean :1
3rd Qu.: 2.0 3rd Qu.: 60.0 3rd Qu.:1.00 3rd Qu.:50 3rd Qu.:2
Max. :7800.0 Max. :640.0 Max. :3.00 Max. :68 Max. :3
NA’s :1355137 NA’s :1355137 NA NA’s :3202733 NA’s :3402406
Table continues below
HINCP GRNTP RNTP FINCP
Min. : -21500 Min. : 4 Min. : 4 Min. : -21500
1st Qu.: 28600 1st Qu.: 670 1st Qu.: 520 1st Qu.: 39000
Median : 57000 Median : 940 Median : 790 Median : 70000
Mean : 80345 Mean :1071 Mean : 921 Mean : 94503
3rd Qu.: 100200 3rd Qu.:1336 3rd Qu.:1200 3rd Qu.: 116030
Max. :3209000 Max. :5022 Max. :4000 Max. :3164000
NA’s :1355137 NA’s :5761729 NA’s :5660370 NA’s :3402406
Table continues below
RMSP NP ADJINC ACR
Min. : 1 Min. : 0.000 Min. :1011189 Min. :1.0
1st Qu.: 4 1st Qu.: 1.000 1st Qu.:1029257 1st Qu.:1.0
Median : 6 Median : 2.000 Median :1035988 Median :1.0
Mean : 6 Mean : 2.105 Mean :1036534 Mean :1.3
3rd Qu.: 7 3rd Qu.: 3.000 3rd Qu.:1045195 3rd Qu.:1.0
Max. :30 Max. :20.000 Max. :1061971 Max. :3.0
NA’s :740715 NA NA NA’s :2170040
Table continues below
INSP FTAXP FHINCP FINSP YBL
Min. : 0 Min. :0.0 Min. :0.0 Min. :0.0 Min. : 1.0
1st Qu.: 450 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0 1st Qu.: 3.0
Median : 800 Median :0.0 Median :0.0 Median :0.0 Median : 5.0
Mean : 988 Mean :0.1 Mean :0.3 Mean :0.1 Mean : 5.2
3rd Qu.:1200 3rd Qu.:0.0 3rd Qu.:1.0 3rd Qu.:0.0 3rd Qu.: 7.0
Max. :9400 Max. :1.0 Max. :1.0 Max. :1.0 Max. :21.0
NA’s :3202733 NA’s :740715 NA’s :740715 NA’s :740715 NA’s :740715
MV HHL VEH
Min. :1.0 Min. :1.0 Min. :0.0
1st Qu.:3.0 1st Qu.:1.0 1st Qu.:1.0
Median :4.0 Median :1.0 Median :2.0
Mean :4.2 Mean :1.3 Mean :1.8
3rd Qu.:6.0 3rd Qu.:1.0 3rd Qu.:2.0
Max. :7.0 Max. :5.0 Max. :6.0
NA’s :1355171 NA’s :1355137 NA’s :1355137
Data summary
Name mainData
Number of rows 7487361
Number of columns 25
_______________________
Column type frequency:
numeric 25
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ST 0 1.00 27.83 15.91 1 12 27 42 56 ▇▅▆▇▆
VALP 3139781 0.58 276504.72 383572.87 100 100000 180000 320000 6308000 ▇▁▁▁▁
LAPTOP 1355137 0.82 1.21 0.41 1 1 1 1 2 ▇▁▁▁▂
HISPEED 2625724 0.65 1.15 0.36 1 1 1 1 2 ▇▁▁▁▂
FULP 1355137 0.82 115.57 498.35 1 2 2 2 7800 ▇▁▁▁▁
GASP 1355137 0.82 46.32 71.80 1 3 20 60 640 ▇▁▁▁▁
TYPE 0 1.00 1.15 0.48 1 1 1 1 3 ▇▁▁▁▁
TAXP 3202733 0.57 33.89 19.96 1 19 31 50 68 ▆▇▇▅▇
WIF 3402406 0.55 1.46 0.89 0 1 2 2 3 ▃▆▁▇▂
HINCP 1355137 0.82 80345.21 88145.46 -21500 28600 57000 100200 3209000 ▇▁▁▁▁
GRNTP 5761729 0.23 1070.96 612.04 4 670 940 1336 5022 ▇▅▁▁▁
RNTP 5660370 0.24 920.68 597.22 4 520 790 1200 4000 ▇▆▁▁▁
FINCP 3402406 0.55 94503.35 95461.60 -21500 39000 70000 116030 3164000 ▇▁▁▁▁
RMSP 740715 0.90 5.99 2.43 1 4 6 7 30 ▇▅▁▁▁
NP 0 1.00 2.10 1.50 0 1 2 3 20 ▇▁▁▁▁
ADJINC 0 1.00 1036534.06 16851.15 1011189 1029257 1035988 1045195 1061971 ▇▇▇▇▇
ACR 2170040 0.71 1.31 0.57 1 1 1 1 3 ▇▁▂▁▁
INSP 3202733 0.57 987.69 976.79 0 450 800 1200 9400 ▇▁▁▁▁
FTAXP 740715 0.90 0.09 0.29 0 0 0 0 1 ▇▁▁▁▁
FHINCP 740715 0.90 0.30 0.46 0 0 0 1 1 ▇▁▁▁▃
FINSP 740715 0.90 0.14 0.34 0 0 0 0 1 ▇▁▁▁▁
YBL 740715 0.90 5.24 3.19 1 3 5 7 21 ▇▅▁▁▁
MV 1355171 0.82 4.23 1.85 1 3 4 6 7 ▆▅▅▇▇
HHL 1355137 0.82 1.33 0.79 1 1 1 1 5 ▇▁▁▁▁
VEH 1355137 0.82 1.84 1.09 0 1 2 2 6 ▇▇▃▁▁

Graphical Summary of the data

Above is the fancy tabular and the histo summary of the data I have taken out from the actual data and I am going to work on this data to create the insightful graphics and insights.

From the above it is clearly shown that the data consist of lot of Null Values.

Note : I have used the skimR (skimr handles different data types and returns a skim_df object which can be included in a tidyverse pipeline or displayed nicely for the human reader.) and the pander package for this beautiful and human readable summary of data.

Insightful graphical and tabular summaries of the data

The knowledge you convey regarding the overall make-up of the data you chose to use

Above Tabular format of data shows the first 10 rows of the data and the Sky Blue rectangles reflects that there is no value present in that cell.

This data is basically rooted from American Community Survey. The American Community Survey collects data on an ongoing basis, January through December, to provide every community with the information they need to make important decisions. They release new data every year, in the form of estimates, in a variety of tables, tools, and analytical reports.

After gone throught all the major part parcel of this data , I have decided to work upon the ‘Housing Data’ (The American Community Survey basically publish Housing as well as the Population Data). I think I might be the only one who will work on ‘Housing Data’. The major reason behind that I want to know more deeply about the housing scheme and information present in the United States.

This Housing Data will provide better insights in terms of Demographics information of the United States of America. Through this data you will both learn about the Data profiles and narrative profiles of Housing particulars and facts.

Once I decided to work upon Housing Data , I start mining the columns on which I would like to work upon. After spending couple of days and after performing scrutiny on the columns I have decided to work upon

“ST”, “VALP”,“LAPTOP”,“HISPEED”,“FULP”,“GASP”,“TYPE”,“TAXP”,“WIF”,“HINCP”,“GRNTP”,“RNTP”,“FINCP”,“RMSP”,“NP”,“ADJINC”,“ACR”,“INSP”,“FTAXP”,“FHINCP”,“FINSP”,“YBL”,“MV”,“HHL”,“VEH”.

The above columns which I have decided almost covers all the possible yields and edges of this data. Also these very columns reflects the Economical , Social and Demographical perception of the people of United States.

Meaningful pre-processing : Which Columns contains more Unwanted and Unusual Values

When working with Huge data , it is ones responsiblity to filter out the junk information from the data. Doing this not only clean up the data but also yields the fruitful plus precise(accurate) results from the processing. Also this will helps to reduce the False Postives and False Negatives from the analysis.

I have created Circular Bar plot (One of its own kind) which actually shows which of the columns contains more unuseful information. The bigger the bar the more it contains the null values and the outliers.

I have explicitly handle those null values and outliers while processing which however produces good visualizing results. The three columns Type , NP and ADJINC is missing from this plot since they does not contain any null values. I have explicilty check these columns and they contain important contraints which should be mandatory.

Meaningful pre-processing : Which Columns contains more NULL VALUES (NA) - Dynamic Representation of Percentages(Plotly)

This plot is a dynamic plot , by hovering on the columns of this graph you get the clearly description of the data. Also , you can filter the TRUE and FALSE values here by just clicking on the Label of TRUE/FALSE on the right. We can also save this image by just one click , there is no need to write any code to download that plus you can also ZOOM IN and ZOOM OUT in this graph.

Null Values are the biggest threat to any analysis. Many of the times it happens that results got deviate from the path of accuracy just because of the presence of the NULL Values.

Before we apply any algorithm on our data, it is obvious that the data should be tidy or structured. But in the real world, the data mostly we initially see is unstructured. So in order to make it tidy and to further apply any algorithm to derive the insights, data has to be cleaned. The major reason why the data is not tidy is because of the presence of missing values and outliers.

In the above analysis, I have created a Dynamic Bar Plot which gives the in-depth description of the null value percentages present in the each columns. This plot clearly tells that which data contains how much percentage of actual values and how much percentage of Null Values.

2. Methodology

To perform the analysis on this I have taken the housing data and have covered all the major columns which actually give precise and detailed information about the housing information and patterns across the United States of America. Since this data comes with Property Information , Resources Consumption, and their expenditure information , I have majorily and explicitly focus on the analysis which yields the information related to Housing Scheme. In this project I have also covered the variablity of each economic variable and put the light on its usage and on its dominion on other important variables.

Also,this data comes with the demographics information which helps to predict the future outcomes of the usablity in terms of transportation , in terms of land price and I have also discussed about the amenities currently United States people make use of.

Since this data also consist of State Information so it was quite obligatory to discuss about the information with respect to the pattern of data across different states.I have unanimously compared the states in terms of the various statistics which eventually produce Jaw Dropping Results which I initially not expecting (Though I wonder) but comes with true results.

While following the Methodology , following things raised :-

How did you deal with missing values? What impact does your approach have on the interpretation or generalizability of the resulting analysis?

How did you deal with outliers?

How did you deal with weights & income adjustment values?

Did you produce any tables or plots that you thought would reveal interesting trends but didn’t?

What’s the analysis that you finally settled on? What relationships do you investigate in the final analysis?

3. Findings

Following are the findings I come up with (It consist of the Tabular Summaires , Graphical Summeries , All the predictive outputs , Statistical Significance) :

Analysis-1

## # A tibble: 51 x 2
##       ST Avg_Value
##    <int>     <dbl>
##  1     1   165098.
##  2     2   236745.
##  3     4   241961.
##  4     5   142313.
##  5     6   584500.
##  6     8   342265.
##  7     9   397463.
##  8    10   286793.
##  9    11   666619.
## 10    12   263859.
## # … with 41 more rows

Property Value in the United States

Working on the Houses data definitely and should put the light on the value of the Property.

Fact worth Noting - If you add the value of all the homes in the United States together, you get a sum that’s a lot to get your mind around: $31.8 trillion. It’s more than 1.5 times the Gross Domestic Product of the United States and approaching three times that of China.

In this analysis I am examining the states with Higher Property Value. Although I have some pre-thoughts regarding it that there are some states which actually falls in this analysis (of course I know that New York should give higher result, if not then there is some problem because new york is considered to be as the most expensive place on earth).

A statewise map gives a good visualization here. Note that I am calculating the average value.Also , the map with darker region mean that the value in that region is quite (darker means maximum value).

Observation comes Out

Bingo !!
My thought was right , from this map we can easily says that the average value of houses in New York is much more higher than the values in the other state. The cluster of values also helps us to get the utmost values.

Analysis-2

Laptop Users with High Speed Internet.

We all know that United States is on the top in terms of creating the technoloy as well as the usage of technology.

It is so exciting to check whether the people of united states have good and highspeed internet services or are they actually struggling.

Being a first world Country I am assuming that more than 95% population should have an awesome interent service.

Observation comes Out

I have plotted a pie chart which clearly makes my assumption almost wrong. Almost 87% of United States Population is actually enjoying the high speed internet services where as more than 10 % of the population of United States are still struggling to get the High Speed Internet.
This is the era of 2019, US should do something for these people and I am hoping that US actually starts something to solve this issue.

Analysis-3

Natural Resources is in Danger!

The fuel and Gas consumption is a biggest thread to Natural Resources since it is getting deplete day by day. It is also well known that United States of America is the only country with maximum consumption of natural resources. In order to check that I have performed a small analysis.

To verify the above statement - I have plotted a bar chart which shows the fuel and gas consumption by states in USA. This eventually helps us to find the reason of why in the specific state this figure is so High !

Here I am assuming that the state with maximum fuel cosumption is the same state having maximum number of Automobiles and Factories. However this graph does not have concrete evidence to prove that the number of automobile figure is high for a specific state.

Analysis-4

House Tax is Quite Affordable

Depending on where you live, property taxes can be a small inconvenience or a major burden.

Fact worth Noting- More than $14 billion in property taxes go unpaid each year in United States

Since I am working on the Household Data , so it is quite essential or mandatory to find that how much tax (Specifically House Tax) a citizen of United States have to add in his/her expenditure list every year. Although it is worth noted that the tax of house is completely based on the location of the house as well as the state you are living in.

I am expecting that the amenities present in the house and the size of the house does not play a vital role in terms of calculating the House Tax.

Based on this link [link] https://www.businessinsider.com/average-property-taxes-every-us-state#51-alabama-1 : Alabama has the lowest property taxes in the US. Now is the time to check that.

Observation comes Out

States near New York and other states like California , illinois and to name a few has bit higher average property tax cost as compared to other states present in USA.

And Yes ! That link was right , after ploting this map we are sure that Alabama has the lowest property taxes in the US.

Analysis-5

##        WIF HINCP GRNTP RNTP FINCP
## WIF   1.00  0.33  0.22 0.20  0.35
## HINCP 0.33  1.00  0.49 0.48  0.98
## GRNTP 0.22  0.49  1.00 0.98  0.48
## RNTP  0.20  0.48  0.98 1.00  0.48
## FINCP 0.35  0.98  0.48 0.48  1.00
##        WIF HINCP GRNTP RNTP FINCP
## WIF   1.00  0.33  0.22 0.20  0.35
## HINCP 0.33  1.00  0.49 0.48  0.98
## GRNTP 0.22  0.49  1.00 0.98  0.48
## RNTP  0.20  0.48  0.98 1.00  0.48
## FINCP 0.35  0.98  0.48 0.48  1.00
## 
## n= 893525 
## 
## 
## P
##       WIF HINCP GRNTP RNTP FINCP
## WIF        0     0     0    0   
## HINCP  0         0     0    0   
## GRNTP  0   0           0    0   
## RNTP   0   0     0          0   
## FINCP  0   0     0     0

Relationship between the Economic Entites in the data

The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. A calculated number greater than 1.0 or less than -1.0 means that there was an error in the correlation measurement. A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows no relationship between the movement of the two variables.

The equation for the correlation

\(\rho X\) Y$ = $( \(X Y\))/ \(\sigma\)X \(\sigma\)Y

WIF -Workers in family during the past 12 months
HINCP - Household income
GRNTP - Gross rent
RNTP - Monthly rent

Above are the columns which actually shows he true color of economical condition of family and a house. In order check whether these are actually correlated to each other we have performed a correlation check between these very columns.

Observation comes Out from this analysis

I have explicitly plotted the ellipse which shows the type of correlation. From these ellipse and the figures confidently states that there is a significant correlation among these variables and it also states that it is a positive correlation.

Also we can specify the significance level using the parameter sig.level = .01 in order to discard the unwanted coorelation.

Analysis-6

## 
##  F test to compare two variances
## 
## data:  predict.data$RMSP and predict.data$VALP
## F = 0.000000000036489, num df = 4347579, denom df = 4347579,
## p-value < 0.00000000000000022
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.00000000003644095 0.00000000003653797
## sample estimates:
##  ratio of variances 
## 0.00000000003648941
## [1] 0.2500915
## [1] 1.367129
## [1] 6.586461

## 
## Call:
## lm(formula = valueofhouse ~ RMSP, data = predict.data.test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -389784 -129011  -31586   50375  632753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   422720      93541   4.519  0.00013 ***
## RMSP          -18546       5768  -3.215  0.00358 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239500 on 25 degrees of freedom
## Multiple R-squared:  0.2925, Adjusted R-squared:  0.2642 
## F-statistic: 10.34 on 1 and 25 DF,  p-value: 0.003579
Estimate Std. Error t value Pr(>|t|)
(Intercept) 422720.22 93541.050 4.519088 0.0001296
RMSP -18545.58 5767.988 -3.215260 0.0035795
## [1] 134.3137

Linear Regression - Predict the Value of the House by the Number of Rooms Available.

Linear regression describes the relationship between a response variable (or dependent variable) of interest and one or more predictor (or independent) variables.
The simple linear regression is used to predict a quantitative outcome y on the basis of one single predictor variable x. The goal is to build a mathematical model (or formula) that defines y as a function of the x variable.

Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis of new x values.

Formula and basics The mathematical formula of the linear regression can be written as y = b0 + b1*x + e, where:

  • b0 and b1 are known as the regression beta coefficients or parameters: +b0 is the intercept of the regression line; that is the predicted value when x = 0. +b1 is the slope of the regression line.

  • e is the error term (also known as the residual errors), the part of y that can be explained by the regression model

  • The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS.

Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This method of determining the beta coefficients is technically called least squares regression or ordinary least squares (OLS) regression.

Correlation Coefficient and Variance Equality The correlation coefficient measures the level of the association between two variables x and y. Its value ranges between -1 (perfect negative correlation: when x increases, y decreases) and +1 (perfect positive correlation: when x increases, y increases).
Compute the correlation coefficient between the two variables using the R function cor(). In our case the the correlaration coefficient is 0.2500915 which basically suggest that the correlation is not too bad A low correlation (-0.15 < x < 0.15).
The variance values shows that they are having same variances (p-value usage).

Computation

The linear model equation can be written as follow: VALP = b0 + b1 * RMSP

Interpretation

+the estimated regression line equation can be written as follow: VALP = 422720 + (-18546)*RMSP

+the intercept (b0) is 422720. It can be interpreted as the predicted sales unit for a zero RMSP advertising budget. Recall that, we are operating in units of thousand dollars.

+the regression beta coefficient for the variable youtube (b1), also known as the slope, is (-18546).

Coefficients significance

t-statistic and p-values:

For a given predictor, the t-statistic (and its associated p-value) tests whether or not there is a statistically significant relationship between a given predictor and the outcome variable, that is whether or not the beta coefficient of the predictor is significantly different from zero.

The statistical hypotheses are as follow:

Null hypothesis (H0): the coefficients are equal to zero (i.e., no relationship between x and y)
Alternative Hypothesis (Ha): the coefficients are not equal to zero (i.e., there is some relationship between x and y) Mathematically, for a given beta coefficient (b), the t-test is computed as t = (b - 0)/SE(b), where SE(b) is the standard error of the coefficient b. The t-statistic measures the number of standard deviations that b is away from 0. Thus a large t-statistic will produce a small p-value.

The higher the t-statistic (and the lower the p-value), the more significant the predictor. The symbols to the right visually specifies the level of significance. The line below the table shows the definition of these symbols; one star means 0.01 < p < 0.05. The more the stars beside the variable’s p-value, the more significant the variable.

A statistically significant coefficient indicates that there is an association between the predictor (x) and the outcome (y) variable. In our case the p value is 0.00358 which states that its a good measurement for VALP. Also,the T-Static is also deviates from 0 so it clearly shows that beta coefficient of the predictor is significantly different from zero.

Model accuracy

Once you identified that, at least, one predictor variable is significantly associated the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the goodness-of-fit

The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:

  • The Residual Standard Error (RSE).
  • The R-squared (R2)
  • F-statistic
rse r.squared f.statistic p.value
1 239500 0.2642 10.34 0.003579

Residual standard error (RSE). The RSE (also known as the model sigma) is the residual variation, representing the average variation of the observations points around the fitted regression line. This is the standard deviation of residual errors.

RSE provides an absolute measure of patterns in the data that can’t be explained by the model. When comparing two models, the model with the small RSE is a good indication that this model fits the best the data.

Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.

In our example, RSE = 239500, meaning that the observed sales values deviate from the true regression line by approximately 239500 units in average.

R-squared and Adjusted R-squared: The R-squared (R2) ranges from 0 to 1 and represents the proportion of information (i.e. variation) in the data that can be explained by the model. The adjusted R-squared adjusts for the degrees of freedom.

The R2 measures, how well the model fits the data. For a simple linear regression, R2 is the square of the Pearson correlation coefficient.

F-Statistic: The F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient.In fact, the F test is identical to the square of the t test: 10.34 = (-3.215)^2

In a simple linear regression, this test is not really interesting since it just duplicates the information in given by the t-test, available in the coefficient table. A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal 10.34 producing a p-value of 0.003579, which is highly significant.

Analysis-7

## [1] -0.002725119

Does House Income depicts the standard of Living in the United States

Correlation coefficients are used in statistics to measure how strong a relationship is between two variables.
The correlation coefficient of two variables in a data set equals to their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.

Here in this we are puting the efforts to check if the HouseHold Income of a Family is actually depicts/correlated to their to living style.
It is quite interested to check if the people of United States with higher household income has a habit of living in the Extra Spacious Area (Big House).

Observation comes Out from this analysis

From the statistcal examiniation/calculations we first remove the outliers and the null values from the data which result the good prediction.

In order to check the correlation we have used the pearson corelation and this corelation is performing on the population data (population correlation coefficient).
By applying the pearson correlation we have the correlation value of -0.002725119 which indicates the weak relation between the variables.
The graph clearly shows that there is certainly no relation between the household income and the house sizes.

Analysis-8

How much Americans Care About Their Houses.

It will be so interesting to check if the people of canada is actually worried about their expenditure on house related issues or house insurance. Since the data consist of the Type of Property information and the House Insurance Information. We can easily check if people actually spends on caring of houses or they just take it as secondary expenditures.

Fact Worth Noting - The average homeowners insurance premium rose by 3.6 percent in 2015, following a 3.3 percent increase in 2014, according to a January 2018 study by the National Association of Insurance Commissioners.The average renters insurance premium fell 1.1 percent in 2015 after rising 1.1 percent in 2014.

Observation comes Out from this analysis

Americans usually not spent much money on House Insurance.They not even invest 1/4 of their income on house insurance. May be the reason behind that is the routine mortgages which is actually burgeoning day by day.

We have also checked for the top 5 rich states of United States and there is no significant change for those states as well.

Also, we can say that the study/fact about the decrement in the investment towards house insurance is concrete and proven.

Analysis-9

Value of Houses based On their Construction Year.

It has been noticed that the median value of a house on 1950 was around 7,400 USD but the shocking part is that the median value of a home in USA now ( 2017 Year ) is around 221,800 USA which is around 30 times of the value in 1950.

It clearly shows that the construction year has also matters a lot. However the size and the location also plays a vital role when it comes to buying a home in United States.

Since our data consist of both value of the home and the construction year , so we can also confirm the above the above statement and can check if the construction year actaully matters a lot when it comes to price adjustment.

Observation comes Out from this analysis

From the graph it is clealy shown that the value of the houses is also depends on the contruction of the Year. From the diagram we can say that the average price of the house in between 1940 and 1990 were around 200,000 USD which has drastically increased to 310,000 (Average Price) in the last decade. It is also clearly shown that the houses built in the year of 2017 has the highest value of prices. So Americans should be ready to loosen their pockets when they even wonder to buy a home.

Analysis-10

Linguistic Demographics versus Year of living.

The America is known to be as a home for different culture people. With different culture people used to speak different languages (Not a their work places but at their home or when with same people around).

People who are living in United states always use english as their language at home. Here we have try to find out that what are the other languages people commonly used at their homes which eventually tells about the demographics.

Observation comes Out from this analysis

From this anaylsis it is clearly shown that maximum number of people who are living here from the past 20 years or more used english as their primary language at home. Also one more point comes from this analysys is that USA has lot of vicinity of Spanish Speakers (or Spaniards) among all other races and languages.

Analysis-11

## [1] FALSE
##      FINCP             VALP        
##  Min.   :     0   Min.   :   1700  
##  1st Qu.: 43600   1st Qu.:  70000  
##  Median : 62780   Median : 122500  
##  Mean   : 84840   Mean   : 173947  
##  3rd Qu.:106138   3rd Qu.: 215500  
##  Max.   :444000   Max.   :1843000
## Support Vector Machines with Linear Kernel 
## 
## 72 samples
##  1 predictor
## 42 classes: '1700', '12000', '13000', '20000', '25000', '30000', '35000', '40000', '50000', '60000', '65000', '70000', '75000', '80000', '84000', '90000', '96000', '100000', '103000', '115000', '120000', '125000', '130000', '135000', '140000', '150000', '160000', '180000', '190000', '200000', '215000', '217000', '220000', '225000', '250000', '300000', '350000', '375000', '400000', '500000', '750000', '829000' 
## 
## Pre-processing: centered (1), scaled (1) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 66, 63, 67, 66, 61, 64, ... 
## Resampling results:
## 
##   Accuracy    Kappa     
##   0.03821225  0.02033985
## 
## Tuning parameter 'C' was held constant at a value of 1
##  [1] 150000 125000 125000 125000 125000 250000 125000 125000 40000  125000
## [11] 40000  125000 40000  125000 125000 125000 150000 125000 125000 125000
## [21] 150000 40000  125000 125000 125000 125000 250000 40000 
## 42 Levels: 1700 12000 13000 20000 25000 30000 35000 40000 50000 ... 829000

Predicting the value of the house by family Income Using Support Vector Machine.

SVM (Support Vector Machine) is a supervised machine learning algorithm which is mainly used to classify data into different classes. Unlike most algorithms, SVM makes use of a hyperplane which acts like a decision boundary between the various classes. How does it works in our data It draws a decision boundary, i.e. a hyperplane between any two classes in order to separate them or classify them. The basic principle behind SVM is to draw a hyperplane that best separates the 2 classes.

Usage To implement the Support Vector Machine (SVM) we need to install the package called the Caret.The caret package is also known as the Classification And REgression Training, has tons of functions that helps to build predictive models.It contains tools for data splitting, pre-processing, feature selection, tuning, unsupervised learning algorithms, etc.

For the usage purpose I divide my data into train and test. I convert the target variable into factor to get the accuracy based on the predition. I have also declared the train control method since the computational power of my machine is not robust.

To perform the SVM prediction I took sample of records since the computational on the actual data takes lot of memory and also need strong and robust computational and processing power. Used prediction method is “repeatedcv”, and rotation count number is 10 for prediction processing.

The SVM process predicts the data for the Testing data with Accuracy of least 4% (started with 1.88%) which is done on the sample. This can definitely be burgeon and perfectly trained (make that 4.2 % from 1.88 by keep that model trained again and again) under robust computing infrastructure.

Analysis-12

State Wise - Number of Workers Present in a House

Please Note : This graph is created with under the plotly package and it is fully dynamic graph. Hover your mouse on the graph and you can find the statics just pointing on the bar. Also, you can save this chart by just one click. You can also zoom in and zoom out within the graph. You can also filter the data by just clicking on the WIF section

Since the household data consist of the information regarding the number of Workers in a house. Somehow it is interesting to find out that which states has maximum numer of One Workers in house and which state has more than two workers in a house.

This finding also helps to find the earning of house based on the number of workers in a house.

Observation comes Out from this analysis

From the above dynamic graph it is clearly shown that:

  • The maximum number of alone workers in family is present in California with a count of 148630.
  • The minimum number of alone workers in family is present in District of Columbia with a count of 2005 only.
  • The most interesting thing is that the most numnber of 0 Workers in a family is also present in California with a count of 65051 and the least number of 0 Workers in family is present in the District of Columbia with a count of a 883.

Analysis-13

## 
##  Pearson's Chi-squared test
## 
## data:  tbltest
## X-squared = 912600, df = 18, p-value < 0.00000000000000022

CHI-SQUARE : Finding the Relationship between Number of Workers in a House and the Vehciles they Own

Performing the Chi-Square Test here - Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical
Particularly in this test, we have to check the p-values. Moreover, like all statistical tests, we assume this test as a null hypothesis and an alternate hypothesis.
The main thing is, we reject the null hypothesis if the p-value that comes out in the result is less than a predetermined significance level, which is 0.05 usually, in our case it is less than the significance level so we can easily reject the null hypothesis.

We create a contigency table to perform that.

Observation comes Out from this analysis

We have a high chi-squared value and a p-value of less than 0.05 significance level. So we reject the null hypothesis and conclude that WIF and VEH have a significant relationship. This clearly shows that more number of members in a family actually needs more car as a source of transportation (Although this can also be depend on the income of the house.)

Analysis-14

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       4     670     940    1071    1336    5022

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  gross.rent$GRNTP
## D = 0.11121, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
## 
##  One Sample t-test
## 
## data:  gross.rent$GRNTP
## t = 126.56, df = 1725631, p-value < 0.00000000000000022
## alternative hypothesis: true mean is not equal to 1012
## 95 percent confidence interval:
##  1070.051 1071.877
## sample estimates:
## mean of x 
##  1070.964

T-Test : Compare the mean of the Gross Rent to the Standard Rent

In this analysis we are comparing the mean of the gross rent with the actual gross rent of the United States these days. To perform that we are using the T.Test over here. Specifically we are performing the One-Sample T.Test

Fact Noted : From the link [link] https://www.deptofnumbers.com/rent/us/ : I have got to know that the median gross rent in the United States is $1012.

So we can justify this and also check whether is actually true over here or not. We can consider the mu value here as the median gross rent of United States of America

What is One Sample T-test

In simple words , one-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (μ).

Assumption to Perform the T.test - Data must be normally distributed - We test our normality here by kolmogorov-smirnov test not with the Shapiro-Wilkie Test because our sample size is very and Shapiro wilkie test only take the maximum values of 5000.

Summary of the data

  • Min.: the minimum value
  • 1st Qu.: The first quartile. 25% of values are lower than this.
  • Median: the median value. Half the values are lower; half are higher.
  • 3rd Qu.: the third quartile. 75% of values are higher than this.
  • Max.: the maximum value

T-test Results

From the results we have got the

  • t is the t-test statistic value (t = 126.56),
  • df is the degrees of freedom (df=1725631),
  • p-value is the significance level of the t-test (p-value = 2.2e-16).
  • conf.int is the confidence interval of the mean at 95% (conf.int = [1070.051, 1071.877]);
  • sample estimates is he mean value of the sample (mean = 1070.964).

Interpretation of the result

The p-value of the test is 2.2e-16, which is less than the significance level alpha = 0.05. We can conclude that the mean value of the gross is significantly different from median value of the gross value in United States.

Analysis-15

Neural Network - Predicting the Value of the Property by Corresponding relative Predictors

Basics of Neural Network

  • A neural network is a model characterized by an activation function, which is used by interconnected information processing units to transform input into output.A neural network has always been compared to human nervous system. Information in passed through interconnected units analogous to information passage through neurons in humans.

  • The first layer of the neural network receives the raw input, processes it and passes the processed information to the hidden layers.

  • The hidden layer passes the information to the last layer, which produces the output. The advantage of neural network is that it is adaptive in nature. It learns from the information provided, i.e. trains itself from the data, which has a known outcome and optimizes its weights for a better prediction in situations with unknown outcome.

  • A perceptron, viz. single layer neural network, is the most basic form of a neural network. A perceptron receives multidimensional input and processes it using a weighted summation and an activation function.

Our case - The objective is to predict VALP by the variables such as INSP, FINCP, FULP , GASP

This tells whether these variables are enough to predict the value of the house. This is quite helpful for a new comer in the area. This eventually can be used to calculate the VALP and to do the town planning as well. Since this tells about the proximity of the resources around.

With the error rate of 1.84 Approx we can say that some of these variables are very helpful in terms of predicting the value of the house and some are causing the randomness in the data which is also acting as an outlier here.

4. Discussion

After effectuating and completing the various noteworthy and significant analysis , I have come up with various outcomes and result which clearly tells about the housing pattern in the United States of America and distinguish between each and every variable in terms of maintaning the economy. Through this scrutiny , I got to know more about the demographical information of the United States of America.

Following are the underlying conceptions which yields out of this analysis:-

  1. From the Statewise Analysis , I confirm that the New York is the state with maximum land value - Initially I only heard about that but now I have the concrete evidence to prove the same fact. New York followed by California is the most expensive state in terms of land value and that is point of noting for some one who is working on the economic sector of the United States.

  2. From the analysis I realize that still many people in United States are actually craving for good internet speed which is point worth noting for tele-communication sector. Also the Natural Resources consumption varies from states to states which is not much influential , though we can have a rough idea about that now.

  3. Since this is a housing data , there is a pure chance of calculating the house related tax. From this analysis I come up with a conclusion that the state with the maximum land value is the only state which costs higher house tax. Although I was bit shocked to know that Louisiana costs less in terms of Housing Tax (being the 3rd most populated state with 4.66 Million Americans)

  4. Using the predictive analysis I confirmed that each economic variable in the data is quite important to one other (finding the correlation). The impact on one can be mirror impact on another which is was not quite obvious. In the predictive section I have predicted that whether the Room Numbers actually impact the House Value or not. Confirming the relation I can say that this matters in many different scenarios. Getting an extra room in your house in the United States of America can losen up your pocket. I have also find out that the household income wont decide the Area of the house , the family is living in.

  5. From the analysis I found that Americans dont care much about their houses and invest almost 1/10th of their salary on house insurance. Infact I checked that on major 5 cities including New York but did not get much fruitful results which makes New York exceptional.

  6. I also got to know that the Year of Construction also decided the value of the House which is not pretty much abvious in every country.The houses constructed after the year 2015 has the maximum value in a decade. Also , the United States consist of lot of Spaniards. Spanish speaking people are the second largest housing in United States after English Speakers.

  7. After analysis , I was so fascinated to know that many houses in United States of America has zero worker in a home. Though there are some chance of discripency here because survival in country like United States without earning is quite difficult (Almost Impossible).

  8. Using the Chi Square test , I found that houses with most family also owns the most number of cars in a home which tells that people of United States are concerned about the transportation but not about living stakes.

Are the models you fit believable?

I belive that the models I fit and used are the right models to work upon but in my case the only concern is about traning them in a recursive way so that they yield accpetable accuracy (which I some time lacks). I can also concern here about the machine because most of the model training is done on cloud based technology.

How much confidence do you have in your analysis? Do you believe your conclusions? Are you confident enough in your analysis and findings to present them to policy makers?

I have around 80%-85% confidence and sure about the conclusion I have constructed. Also there is always a room of Improvement everywhere. The conclusion I have constructed is completely helpful for Policy Simulation and Creation in a Country. I completely believe in the effort I have put on and also believes that this is quite helpful for a viewer. I am confident about my outputs and results here. Although I need work more upon the data plotting part.

5. Overall Code and Appendix

Code-1 : Reading the data here inside this chunk

# reading the data set only

library(data.table)
library(pander)
library(kableExtra)
library(skimr)


# start_time <- Sys.time()
# mainData <- fread("MainData.csv")
# end_time <- Sys.time()


#end_time - start_time
options(scipen=10000)

start_time <- Sys.time()

# selecting the column on which I have to work upon
cols <- c("ST", "VALP","LAPTOP","HISPEED","FULP","GASP","TYPE","TAXP","WIF","HINCP","GRNTP","RNTP","FINCP","RMSP","NP","ADJINC","ACR","INSP","FTAXP","FHINCP","FINSP","YBL","MV","HHL","VEH")

# using the fread method of the data.table package to read the huge data in less time.
mainData <- fread("MainData.csv",select = cols)
end_time <- Sys.time()

# calculating the time of reading here
end_time - start_time


# getting the fancy summary of data using pander package
pander(summary(mainData))

# getting the summary plus histogram of data using the skim package
skim(mainData)

Code-2 : Getting the insight of data here

library(kableExtra)
library(dplyr)
library(magrittr)
library(DT)
options(DT.options = list(pageLength = 30))


top10.mainData <- mainData[1:10]
top10.mainData %>%
    datatable(options = list(dom = "t", ordering = FALSE), 
              rownames = FALSE,
              width = 30) %>%
    formatStyle(c("ST", "VALP","LAPTOP","HISPEED","FULP","GASP","TYPE","TAXP","WIF","HINCP","GRNTP","RNTP","FINCP","RMSP","NP","ADJINC","ACR","INSP","FTAXP","FHINCP","FINSP","YBL","MV","HHL","VEH"), backgroundColor = styleEqual(NA, "skyblue"))

Code-3 : Checking the unwanted and null values present in the data

library(dplyr)
library(tidyverse)
library(ggdark)
library(ggplot2)



missing.values <- mainData %>%
    gather(key = "key", value = "val") %>%
    mutate(is.missing = is.na(val)) %>%
    group_by(key, is.missing) %>%
    summarise(num.missing = n()) %>%
    filter(is.missing==T) %>%
    select(-is.missing) 

missing.values$id <- seq(1,21)


label_data <- missing.values
number_of_bar <- nrow(label_data)
angle <-  90 - 360 * (label_data$id) /number_of_bar
label_data$hjust<-ifelse( angle < -90, 1, 0)
label_data$angle<-ifelse(angle < -90, angle+180, angle)

ggplot(missing.values, aes(x=as.factor(id), y=num.missing)) +       # Note that id is a factor. If x is numeric, there is some space between the first bar
  
  # This add the bars with a blue color
  geom_bar(stat="identity", fill=alpha("yellow", 1)) +
  
  # Limits of the plot = very important. The negative value controls the size of the inner circle, the positive one is useful to add size over each bar
  ylim(-1,120) +
  
  # Custom the theme: no axis title and no cartesian grid
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    plot.margin = unit(rep(-1,4), "cm")      # Adjust the margin to make in sort labels are not truncated!
  ) +
  
  # This makes the coordinate polar instead of cartesian.
  coord_polar(start = 0) +
  
  # Add the labels, using the label_data dataframe that we have created before
  geom_text(data=label_data, aes(x=id, y=num.missing+10, label=key, hjust=hjust), color="White", fontface="bold",alpha=0.6, size=2.5, angle= label_data$angle, inherit.aes = FALSE ) + scale_y_continuous(limits=c(0,100) + scale_x_continuous(limits=c(0,100))) + dark_theme_void() + labs(title = "Circular Bar Plot of missing values")

Code-4 : Calculating the Null values and creating the insight

library(dplyr)
library(ggplot2)
library(plotly)
library(tidyverse)
library(ggdark)


missing.values.percentage <- mainData %>%
  gather(key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(key, total, isna) %>%
  summarise(num.isna = n()) %>%
  mutate(pct = num.isna / total * 100)




levels <-
    (missing.values.percentage  %>% filter(isna == T) %>% arrange(desc(pct)))$key

percentage.plot <- missing.values.percentage %>%
      ggplot() +
        geom_bar(aes(x = reorder(key, desc(pct)), 
                     y = pct, fill=isna , width=.8), 
                 stat = 'identity', alpha=0.35 , position = position_dodge() , colour="black") +
      scale_x_discrete(limits = levels) +
      scale_fill_manual(name = "", 
                        values = c('purple3', 'gold1'), labels = c("Present", "Missing")) +
      coord_flip() +
      labs(title = "Percentage of missing values", x =
             'Variable', y = "% of missing values")


percentage.ploting <- ggplotly(percentage.plot)
percentage.ploting

Code-5 : Calculating and plotting the average property value in USA by states

library(dplyr)
library(ggplot2)
library(ggdark)

valuebystate <- mainData %>% select(ST , VALP)
plotdata <- valuebystate %>% group_by(ST) %>% filter(!any(is.na(ST))) %>% summarise(Avg_Value = mean(VALP,na.rm = TRUE))

statewise.landValue <- valuebystate %>% group_by(ST) %>% filter(!any(is.na(ST))) %>% summarise(Avg_Value = mean(VALP,na.rm = TRUE))
statewise.landValue
tb <- (plotdata$ST)
tbNorm <- (plotdata$Avg_Value)

require ("ggplot2")
require ("choroplethr")
require ("choroplethrMaps")


states <- data.frame ( c (1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51,
53, 54, 55, 56, 72),
    c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
    "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois",
    "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts",
    "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada",
    "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota",
    "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island",
    "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia",
    "Washington", "West Virginia", "Wisconsin", "Wyoming", "Puerto Rico"))
    
names (states) = c ("Index", "State")

# takes a table across states as input and prints it on a map
# uses the "choroplethr" package
plot.map <- function (tb, title = "", legend = "") {
 
  i <- match (names (tb), as.character(states$Index))
  states <- data.frame (tb, tolower(states$State[i]))
  states <- states [,c (3,2)]
  
  # this is required by choroplethr
  names (states) <- c ("region", "value")
  states$region <- as.character (states$region)
  
  # this is identical to state_choroplethr except that the labels are being removed
  c = StateChoropleth$new(states)
  c$set_num_colors(7)
  c$title  = title
  c$legend = legend
  c$set_zoom(NULL)
  c$show_labels = FALSE
  c$render()
}

avg.val.data <- (statewise.landValue$Avg_Value)
state.data <- (plotdata$ST)
avg.data <- state.data / avg.val.data
geographical.rep.table <- as.table(setNames(tbNorm,tb))
plot.map (geographical.rep.table, "Average Property Value By States in US", "Cluster of Land Value") + theme(plot.title = element_text(hjust = 0.5)) + dark_theme_linedraw() + theme(legend.background = element_rect(fill="black",
                                  size=0.5, linetype="solid", 
                                  colour ="white"))

Code-6 : Calculating and plotting the Laptop users with High speed interent availablity

user.laptop.hispeed <- mainData %>% filter(LAPTOP != 'b' , LAPTOP ==1 , HISPEED !='NA' ) %>% select(LAPTOP , HISPEED)
user.laptop.hispeed.pie <- user.laptop.hispeed %>% group_by(LAPTOP,HISPEED) %>% summarise(COUNT = n()) %>% mutate(lab.ypos = cumsum(COUNT) - 0.5*COUNT)
user.laptop.hispeed.pie$HISPEED <- factor(user.laptop.hispeed.pie$HISPEED , labels = c("Yes" , "No"))


ggplot(user.laptop.hispeed.pie, aes(x = "", y = COUNT , fill = HISPEED)) +
  geom_bar(width = 1, stat = "identity", color = "black") +
  coord_polar("y", start = 0) + 
  theme_void() + geom_text(aes(label = paste0(COUNT, " (", scales::percent(COUNT / sum(COUNT)),")")),
position = position_stack(vjust = 0.8) , check_overlap = T , size = 3.5) + labs(title = "Desktop Users With HighSpeed Internet Service" , fill = "Hispeed Connectivity") + theme(legend.text = element_text(face = "italic", colour="steelblue4",family = "Helvetica"),legend.title = element_text(colour = "steelblue",  face = "bold.italic", family = "Helvetica")) + theme(
  legend.box.background = element_rect(),
  legend.box.margin = margin(6, 6, 6, 6)
)

Code-7 : Calculating and plotting the natural resources consumption in united states

library(tidyverse)

# Natural Resources Consumption by States
FuelCost <- mainData %>% select(ST , FULP , GASP) %>% filter(FULP > 2 , GASP >2)
FuelCostPlot <- FuelCost %>% group_by(ST) %>% summarise(Avg_Fuel_Usage = mean(FULP , na.rm = TRUE) , Avg_Gas_Usage = mean(GASP , na.rm = TRUE))


FuelCostPlot$ST <- factor(FuelCostPlot$ST)
levels(FuelCostPlot$ST) <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
    "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois",
    "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts",
    "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada",
    "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota",
    "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island",
    "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia",
    "Washington", "West Virginia", "Wisconsin", "Wyoming", "Puerto Rico")

state.wise.consumption.data <- FuelCostPlot %>%
  gather("Stat", "Value", -ST)

ggplot(state.wise.consumption.data, aes(x = ST, y = Value, fill = Stat)) +
  geom_col(position = "dodge") + coord_flip() +scale_fill_manual(values=c("#D823DE", "#23DEA0"))  + theme(legend.box.background = element_rect(),legend.box.margin = margin(6, 6, 6, 6)) + labs(title = "Natural Resources Consumption by States") + theme(axis.title.x = element_blank(),axis.text.y=element_text(size=rel(0.8))) + dark_theme_linedraw() + xlab("State Names") +ylab("Consumption Values") + theme(legend.background = element_rect(fill="black",
                                  size=0.5, linetype="solid", 
                                  colour ="white"))

Code-8 : Calculating and plotting the average house tax by states in the United States

library(mapproj)
library(ggplot2)
library(ggdark)


states_data_map <- map_data("state")

Houses.only <- mainData %>% filter(TYPE == 1 , FTAXP == 1) %>% select(ST , TAXP)


avg.tax <- Houses.only %>% group_by(ST) %>% summarise(avg.mean.tax = mean(TAXP , na.rm = T))

ST<-c(1,2,4,5,6,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56,72)
region<-c('alabama','alaska','arizona','arkansas','california','colorado','connecticut','delaware','district of columbia','florida','georgia','hawaii','idaho','illinois','indiana','iowa','kansas','kentucky','louisiana','maine','maryland','massachusetts','michigan','minnesota','mississippi','missouri','montana','nebraska','nevada','new hampshire','new jersey','new mexico','new york','north carolina','north dakota','ohio','oklahoma','oregon','pennsylvania','rhode island','south carolina','south dakota','tennessee','texas','utah','vermont','virginia','washington','west virginia','wisconsin','wyoming','puerto rico')

states.data <- data.frame(ST,region)

states.data$ST <- as.factor(as.character(states.data$ST))
avg.tax$ST <- as.factor(as.character(avg.tax$ST))

common.data <-   merge(avg.tax,states.data,by="ST")
mapcreation  <-  merge(states_data_map, common.data, by="region")

centroids <- data.frame(region=tolower(state.name), long=state.center$x, lat=state.center$y)
centroids$abb<-state.abb[match(centroids$region,tolower(state.name))]


ggplot(mapcreation, aes(x = long, y = lat, group = group, fill = avg.mean.tax)) + 
coord_map("gilbert") +
theme_light() +
scale_fill_continuous(low="Green",high="red",limits=c(min(mapcreation$avg.mean.tax), max(mapcreation$avg.mean.tax))) +
labs(title = "Average House Tax (Property) Figures - By States",fill="Average\nProperty Tax\n Costs($)") + geom_polygon(colour = "black") + theme(strip.background = element_blank(), strip.text.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks  = element_blank(), axis.line   = element_blank(), panel.border= element_blank(), panel.grid  = element_blank(), legend.position = "right") + xlab("") + ylab("") + with(centroids,  ggplot2::annotate(geom="text", x = long, y=lat, label = abb, 
size = 3,color="black",family="Times")) + theme(legend.box.background = element_rect(),legend.box.margin = margin(6, 6, 6, 6)) + theme(plot.title = element_text(hjust = 0.5)) + theme(plot.title = element_text(hjust = 0.5)) + dark_theme_linedraw() + theme(legend.background = element_rect(fill="black",
                                  size=0.5, linetype="solid", 
                                  colour ="white"))

Code-9 : Checking and plotting the Economic entities in the United States

# Coorelation between all economic aspect columns which proves the basic income and expenditure in a household.
require(ggpubr)
require(tidyverse)
require(Hmisc)
require(corrplot)
library(ellipse)
library(RColorBrewer)
corr.components <- mainData %>% select(WIF , HINCP , GRNTP , RNTP , FINCP )
corr.components <- na.omit(corr.components)


round(cor(corr.components), 2)
rcorr(as.matrix(corr.components))

M<-cor(corr.components)
corrplot(M, method = "ellipse",col=brewer.pal(n=8, name="PuOr"))
corrplot(M, method = "number",col=brewer.pal(n=8, name="PuOr"))

Code-10 : Predict the Value of the House by the Number of Rooms Available

library(moments)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(broom)
library(ggdark)


# Extracting the valueable columns
predict.data <- mainData %>% filter(RMSP != 'NA' , VALP != 'NA' ) %>% select(RMSP , VALP)
predict.data <- na.omit(predict.data)

# testing the variance of dataset
var.test(predict.data$RMSP,predict.data$VALP)


# checking the correlation between the variables after the filtering.
cor(predict.data$VALP , predict.data$RMSP)


# Checking the normality

#checking the skewness of RMSP column
skewness(predict.data$RMSP)

#checking the skewness of VALP column
skewness(predict.data$VALP)

ggplot(predict.data, aes(x=RMSP)) + 
 geom_histogram(aes(y=..density..), colour="darkblue", fill="lightblue")+
 geom_density(alpha=.2, fill="#FF6666" , color = 'black') + labs(title="Histogram for Room Number") +
  labs(x="Number of Rooms", y="Desnity Values") + theme(plot.title = element_text(hjust = 0.5))  + geom_vline(aes(xintercept=mean(RMSP)),
            color="blue", linetype="dashed", size=1)


ggplot(predict.data, aes(x=VALP)) + 
 geom_histogram(aes(y=..density..), colour="darkblue", fill="red")+
 geom_density(alpha=.2, fill="#FF6666",color = 'black') + labs(title="Histogram for Value of House") +
  labs(x="Value of House", y="Desnity Values") + theme(plot.title = element_text(hjust = 0.5))  + geom_vline(aes(xintercept=mean(RMSP)),
            color="red", linetype="dashed", size=1)

# checking the relation by visualization
ggplot(predict.data , aes(x = VALP , y = RMSP)) + geom_point(color = "black") + stat_smooth()

predict.data.test <- predict.data %>% group_by(RMSP) %>% dplyr::summarise(valueofhouse = n())

lm.method.predict <- lm(VALP ~ RMSP, data =
             predict.data)

lm.method.predict.test <- lm(valueofhouse ~ RMSP, data =
             predict.data.test)


summary(lm.method.predict.test)

lm.method.predict.summary <- summary(lm.method.predict.test)

kable(lm.method.predict.summary$coefficients) %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(0:2, bold = T, color = "black", background = "skyblue")

sigma(lm.method.predict)*100/mean(predict.data$VALP)

qqnorm(lm.method.predict$residuals, col = "yellow")
qqline(lm.method.predict$residuals, col = "blue")

lm.method.predict.test.metrics <- augment(lm.method.predict.test)

ggplot(lm.method.predict.test.metrics, aes(RMSP, valueofhouse)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = RMSP, yend = .fitted), color = "red", size = 0.3) + annotate("rect", xmin=c(3.5,15), xmax=c(9.5,26), ymin=c(250000,0) , ymax=c(750000,150000), alpha=0.2, color="white", fill="blue") + dark_theme_linedraw() + annotate("text", x=11, y=820000, label= "Maximum Residual Value \n (R.S.S value very high)") + annotate("text", x=22, y=220000, label= "Minimum Residual Value \n (R.S.S value Accurate)") + xlab("Room Space") + ylab("Value of the House")

Code-11 : Does House Income depicts the standard of Living in the United States

require(ggpubr)
# Correlation between the Income and the House size

income.house.data <- mainData %>% select(HINCP , NP , ADJINC , ACR) %>% filter(ACR != 'NA')
income.house.data <- na.omit(income.house.data)

income.house.data$AdjustedHousehold_Inc<-income.house.data$HINCP*unique(income.house.data$ADJINC/1E06,incomparables = FALSE)

# removing the outliers

outliers<-boxplot(income.house.data$AdjustedHousehold_Inc,plot=F)$out

# finding the corelation

plot(income.house.data$ACR,income.house.data$AdjustedHousehold_Inc,col="blue")
income.house.data.df<- income.house.data[,c("AdjustedHousehold_Inc","ACR")]
income.house.data.df1<- income.house.data.df[complete.cases(income.house.data.df),]
cor(income.house.data.df1$AdjustedHousehold_Inc,as.numeric(income.house.data.df1$ACR))

ggscatter(income.house.data.df1, x = "AdjustedHousehold_Inc", y = "ACR",
   add = "reg.line",
   add.params = list(color = "blue", fill = "lightgray"),
   conf.int = TRUE
   ) + stat_cor(method = "pearson", label.x = 3, label.y = 30)

Code-12 : How much Americans Care About Their Houses.

library(ggdark)
library(viridis)

#how much americans care about their homes

house.insurance.care <- mainData %>% select(TYPE , FINSP , FHINCP , HINCP , INSP , ST)


# gettng the type of unit = 1 , Fire, hazard, flood insurance  = 1 and household income
house.insurance.cols <-  subset(house.insurance.care,TYPE==1 & FINSP==1 & FHINCP==1) 
house.data<- house.insurance.cols[,c('ST','INSP','HINCP')]

unique.states <- unique(house.data$ST)

insurance.data <- as.data.frame(tapply(house.data$INSP,house.data$ST,mean))
colnames(insurance.data) <- c("Avg")
dataframe.insu <- as.data.frame(cbind(ST=unique.states,Avg=insurance.data$Avg))
dataframe.insu$flag <- c("Insurance")
income.data = as.data.frame(tapply(house.data$HINCP,house.data$ST,mean))
colnames(income.data) <- c("Avg")
dataframe.inc <- as.data.frame(cbind(ST=unique.states,Avg=income.data$Avg))
dataframe.inc$flag <-  c("Income")
dataframe.insu$ST=as.factor(as.character(dataframe.insu$ST))
dataframe.inc$ST=as.factor(as.character(dataframe.inc$ST))

insurance.house.df <- rbind(dataframe.insu,dataframe.inc)
ST<-c(1,2,4,5,6,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56,72)
region<-c('alabama','alaska','arizona','arkansas','california','colorado','connecticut','delaware','district of columbia','florida','georgia','hawaii','idaho','illinois','indiana','iowa','kansas','kentucky','louisiana','maine','maryland','massachusetts','michigan','minnesota','mississippi','missouri','montana','nebraska','nevada','new hampshire','new jersey','new mexico','new york','north carolina','north dakota','ohio','oklahoma','oregon','pennsylvania','rhode island','south carolina','south dakota','tennessee','texas','utah','vermont','virginia','washington','west virginia','wisconsin','wyoming','puerto rico')
housing.states <- data.frame(ST,region)
housing.states$ST <- as.factor(as.character(housing.states$ST))
insurance.house.df$ST=as.factor(as.character(insurance.house.df$ST))

all.housing.insu <- merge(insurance.house.df,housing.states,by="ST")
all.housing.insu.top5 = subset(all.housing.insu, ST ==6 | ST == 11 | ST == 25 |ST == 36 |ST == 53 )

ggplot(data=all.housing.insu, aes(x=region, y=round(Avg), fill=flag)) +
    geom_bar(stat="identity", position=position_dodge())+ coord_flip() +
theme(axis.title.x = element_blank(),axis.text.y=element_text(size=rel(0.8)))  + scale_fill_manual(values=c("#23DEDB", "#DEDE23")) +theme(legend.box.background = element_rect(),legend.box.margin = margin(6, 6, 6, 6)) + theme(plot.title = element_text(hjust = 0.5)) + dark_theme_linedraw() + theme(legend.background = element_rect(fill="black",
                                  size=0.5, linetype="solid", 
                                  colour ="white"))

Code-13 : Value of Houses based On their Construction Year

library(ggplot2)
library(dplyr)
library(ggdark)

options(scipen=10000)

yearbuild.propertyval <- mainData %>% select(YBL , VALP)
yearbuild.propertyval <- na.omit(yearbuild.propertyval)
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 01 , "1939 or earlier")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 02 , "1940 to 1949")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 03 , "1950 to 1959")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 04 , "1960 to 1969")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 05 , "1970 to 1979")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 06 , "1980 to 1989")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 07 , "1990 to 1999")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 08 , "2000 to 2004")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 09 , "2005")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 10 , "2006")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 11 , "2007")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 12 , "2008")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 13 , "2009")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 14 , "2010")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 15 , "2011")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 16 , "2012")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 17 , "2013")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 18 , "2014")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 19 , "2015")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 20 , "2016")
yearbuild.propertyval$YBL <- replace(yearbuild.propertyval$YBL, yearbuild.propertyval$YBL == 21 , "2017")

avg.price.yearwise <- yearbuild.propertyval %>% group_by(YBL) %>% summarise(mean.value = mean(VALP , na.rm = T))

ggplot(avg.price.yearwise, aes(x=`YBL`, y=mean.value , label = "")) + 
  geom_point(stat='identity', fill="black", size=6 , color="Red" , alpha = 0.6)  +
  geom_segment(aes(y = 0, 
                   x = `YBL`, 
                   yend = mean.value, 
                   xend = `YBL`), 
               color = "white") +
  geom_text(color="white", size=3) +
  labs(title="Value of House Based on Construction Year", 
       subtitle="Does Construction Year Matters ?") + coord_flip() + dark_theme_linedraw() + xlab("Year of Build") + ylab("Average Price of House")

Code-14 : Linguistic Demographics versus Year of living

library(ggplot2)
library(dplyr)
library(ggdark)

move.data <- mainData %>% select(MV,HHL) %>% filter(MV != "NA" , HHL != "NA") %>% na.omit(move.data)
move.data$MV <- replace(move.data$MV, move.data$MV == 1 , "12 months or less")
move.data$MV <- replace(move.data$MV, move.data$MV == 2 , "13 to 23 months")
move.data$MV <- replace(move.data$MV, move.data$MV == 3 , "2 to 4 years")
move.data$MV <- replace(move.data$MV, move.data$MV == 4 , "5 to 9 years")
move.data$MV <- replace(move.data$MV, move.data$MV == 5 , "10 to 19 years")
move.data$MV <- replace(move.data$MV, move.data$MV == 6 , "20 to 29 years")
move.data$MV <- replace(move.data$MV, move.data$MV == 7 , "30 years or more")


move.data$HHL <- replace(move.data$HHL, move.data$HHL == 1 , "English only")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 2 , "Spanish")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 3 , "Other Indo-European languages")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 4 , "Asian and Pacific Island languages")
move.data$HHL <- replace(move.data$HHL, move.data$HHL == 5 , "Other language")

move.data <- move.data %>% group_by(MV, HHL) %>% summarise(count.values = n())

ggplot() + geom_point(data = move.data, aes(x = MV, y = count.values, size = 5, color = HHL, shape = HHL)) + coord_flip() +dark_theme_linedraw()  + xlab("Living Here Since..") + ylab("Values") + labs(title = "Migrate Period Along with Linguistic Demographics")

Code-15 : Predicting the value of the house by family Income Using Support Vector Machine

library(caret)
library(dplyr)
library(kernlab)

# getting the data
f.income.house.size <- mainData %>% select(FINCP,VALP) %>% filter(FINCP != "NA" , VALP != "NA" ) %>% na.omit(f.income.house.size)
# selecting the minimum data because of machine limit
f.income.house.size <- head(f.income.house.size,100)


# partitioning the data
intrain <- createDataPartition(y = f.income.house.size$VALP, p= 0.7, list = FALSE)
training <- f.income.house.size[intrain,] 
testing <- f.income.house.size[-intrain,]
training[["VALP"]] = factor(training[["VALP"]])

# anyNA() method, which checks for any null values
anyNA(f.income.house.size)

# summary of the data
summary(f.income.house.size)

# assigning the train control method here. Control all the computational overheads so that we can use the train() function provided by the caret package
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)


# training the model usin
svm_Linear <- train(VALP ~ FINCP, data = training, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

# predicting the model
svm_Linear

# prediciting the respective values
test_pred <- predict(svm_Linear, newdata = testing)
test_pred

Code-16 : State Wise - Number of Workers Present in a House (Dynamic Graph)

library(dplyr)
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(ggdark)
library(plotly)



state.workers.home <- mainData %>% select(ST, WIF) %>% filter(WIF != "NA") %>% na.omit(state.workers.home)
state.workers.home <- state.workers.home %>% dplyr::group_by(ST,WIF) %>% dplyr::summarise(count = n())

# Changing the section here.
state.workers.home$WIF <- replace(state.workers.home$WIF, state.workers.home$WIF == 0 , "Zero Worker")
state.workers.home$WIF <- replace(state.workers.home$WIF, state.workers.home$WIF == 1 , "One Worker")
state.workers.home$WIF <- replace(state.workers.home$WIF, state.workers.home$WIF == 2 , "Two Worker")
state.workers.home$WIF <- replace(state.workers.home$WIF, state.workers.home$WIF == 3 , "Three Worker")

# Changing the state names here.

state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 1,"alabama")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 2,"alaska")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 4,"arizona")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 5,"arkansas")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 6,"california")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 8,"colorado")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 9,"connecticut")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 10,"delaware")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 11,"district of columbia")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 12,"florida")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 13,"georgia")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 15,"hawaii")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 16,"idaho")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 17,"illinois")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 18,"indiana")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 19,"iowa")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 20,"kansas")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 21,"kentucky")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 22,'louisiana')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 23,"maine")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 24,"maryland")
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 25,'massachusetts')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 26,'michigan')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 27,'minnesota')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 28,'mississippi')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 29,'missouri')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 30,'montana')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 31,'nebraska')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 32,'nevada')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 33,'new hampshire')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 34,'new jersey')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 35,'new mexico')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 36,'new york')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 37,'north carolina')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 38,'north dakota')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 39,'ohio')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 40,'oklahoma')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 41,'oregon')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 42,'pennsylvania')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 44,'rhode island')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 45,'south carolina')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 46,'south dakota')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 47,'tennessee')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 48,'texas')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 49,'utah')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 50,'vermont')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 51,'virginia')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 53,'washington')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 54,'west virginia')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 55,'wisconsin')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 56,'wyoming')
state.workers.home$ST <- replace(state.workers.home$ST, state.workers.home$ST == 72,'puerto rico')


p <- ggplot(state.workers.home, aes(fill=WIF, y=count, x=ST)) + 
    geom_bar(position="stack", stat="identity") +
    scale_fill_viridis(discrete = T) +
    ggtitle("Statewise - Number of Workers Present in Home") +
    theme_ipsum() +
    xlab("State Names") + theme(axis.text.x = element_text(angle=90, hjust=1, size = 7))

ploting <- ggplotly(p)
ploting

Code-17 : Finding the Relationship between Number of Workers in a House and the Vehciles they Own

# relationship bw workers in the family and monthly rent

library(dplyr)
library(gplots)
library(ggdark)

worker.vehciles <- mainData %>% select(WIF, VEH) %>% filter(WIF != "NA" , VEH != 'NA') %>% na.omit(worker.vehciles)
tbltest <- table(worker.vehciles$WIF,worker.vehciles$VEH)

# performing the chi square test 
chisq.test(tbltest)

balloonplot(t(tbltest), main ="Workers in House - Vehciles They Own", xlab ="Vehicles", ylab="No of Workers",
            label = FALSE, show.margins = FALSE)

#create density curve
curve(dchisq(x, df = 18), from = 0, to = 40,
main = 'Chi-Square Distribution (df = 18)',
ylab = 'Density',
lwd = 2)

Code-18 : Compare the mean of the Gross Rent to the Standard Rent

library(dplyr)
library(ggpubr)

# filter the gross data from the main data and avoiding the null values and the values which are less than 0.
gross.rent <- mainData %>% select(GRNTP) %>% filter(GRNTP != "NA" , GRNTP != 0) %>% na.omit(month.rent)

# summary of the data
summary(gross.rent$GRNTP)

# plotting the histogram plot -to visualize the normality around the mean
ggplot(gross.rent, aes(x=GRNTP)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="yellow" , alpha = 0.8)+
 geom_density(alpha=.2, fill="#FF6666" , color = 'green') + labs(title="Histogram for Gross Rent") +
  labs(x="Gross Rent", y="Desnity Values") + theme(plot.title = element_text(hjust = 0.5))  + geom_vline(aes(xintercept=mean(GRNTP)),
            color="blue", linetype="dashed", size=1)

# testing the normality here with qqplot
ggqqplot(gross.rent$GRNTP)


# for the large data set we use kolmogorov-smirnov test rather than shapiro wilkie test.
ks.test(gross.rent$GRNTP , y = 'pnorm',mean = 1070.964 , sd = 612.037)

 # performig the one sample T- test here.
t.test(gross.rent$GRNTP, mu = 1012)

Code-19 : Neural Network - Predicting the Value of the Property by Corresponding relative Predictors

library(neuralnet)
library(dplyr)
library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')

nndata<- mainData %>% select(INSP,FINCP,FULP,GASP,VALP) %>% filter(FINCP != "NA" , INSP != "NA",FULP != "NA",GASP != "NA") %>% na.omit(nndata)

nndata <- head(nndata,1000)

# Random sampling
samplesizedata = 0.60 * nrow(nndata)
set.seed(80)
indexdatann = sample( seq_len ( nrow ( nndata ) ), size = samplesizedata )

# Create training and test set
datatrain = nndata[ indexdatann, ]
datatest = nndata[ -indexdatann, ]



## Scale data for neural network

maxnn = apply(nndata , 2 , max)
minnn = apply(nndata, 2 , min)
scaled = as.data.frame(scale(nndata, center = minnn, scale = maxnn - minnn))

trainNN = scaled[indexdatann , ]
testNN = scaled[-indexdatann , ]


set.seed(4)
NN = neuralnet(VALP ~ INSP + FINCP + FULP + GASP , trainNN, hidden = 3 , linear.output = T )

plot(NN , col.out = 'blue' , fontsize = 9,col.out.synapse = "red",col.intercept = "blue" , col.entry = 'skyblue',col.entry.synapse = 'red' , rep= "best")

Package Information/Installation

In order to perform the actual analysis and to create fancy graphs and tables I have used couple of new packages which are already not pre-defined and needs installation in most of the cases.

Please note that , All the packages I have used over here can easily be installed by one single command.

install.packages(“package name”)

Following are the packages you need to install:-